testingiauh912fandomcom-20200214-history
RELIABILITY -Simin Saatian
'RELIABILITY ' Reliability of a test is an important selling point for publishers of standardized test, especially in high-stakes testing. If an institute asserts that its instrument can identify children who qualify for a special education program, the users of that test would hope that it has a high reliability. Otherwise, some children requiring the special education may not be identified, whereas others who do not require the special education may be unnecessarily assigned to the special education program. In the test publishing world, the reliability of a draft test instrument is often quickly established. However, after the test is used many times, its validity might be questioned. A test designed as a verbal reasoning test may rely heavily on the test-taker's knowledge of music, art, and history. Because the test is reliable, the publishers might redefine what construct it is measuring; in this case, it is a better measure of the students' knowledge of the humanities than of verbal reasoning. The publisher would recall all copies of the Verbal Reasoning test; then, with little change to the test, the publisher could offer it again as the Humanities Achievement test. Test reliability is explained through the true score theory and the theory of reliability. True score theory states that the observed score on a test is the result of a true score plus some error in measurement. The theory of reliability compares the reliability of a test of human characteristics with the reliability of measuring instruments in the physical sciences. 'THE THEORY OF RELIABILITY ' In summary, according to reliability theory, reliability is equal to the ratio of the variance of the true score to the variance of the observed score. Calculating the ratio of the estimated variance of the true score to the variance of the observed score is the same as calculating the correlation between two observed scores. Therefore, the correlation of two repeated measures of the same test is accepted as an appropriate estimate of the reliability of the test. 'TYPES OF RELIABILITY ' Inter-rater (or inter-observer) reliability is an important consideration in the social sciences because there are many conditions for which the best means of measurement is the report of trained observers. Some classes such as gymnastics can only be assessed through the ratings of expert judges. As another example, external observers may be brought into a classroom to assess a student's inappropriate behavior. The observations of only one observer can be challenged from so many points of view. A lone observer may have some personal expectancies of what is supposed to occur. The lone observer may get tired and bored, so that earlier observations are more precise than later ones. It is less likely that the reports of two or more observers would be challenged. Particularly, the acceptability of the reports of two or more observers increases when their observations are similar. The measure of the similarities of the observations coming from two or more sources is the inter-rater reliability. One method to establish inter-rater reliability is to calculate the proportion of agreement between or among the observers. This is appropriate if the ratings or observations are in mutually exclusive categories. The two observers recording the behavior of the student with the inappropriate behavior would do well to have a common checklist of the likely behaviors. If they agree on the occurrence of 16 out of 20 behaviors, their inter-rater reliability would be 80 percent. Another method is to calculate the correlation between the ratings of the two or more observers. This is possible if the ratings or observations are two or more sets of interval numbers. The gymnastic judges would have different ratings. Some may be consistently rating high while others consistently rating low. However, there should be some general agreement on the ranking of the different performers. The strength of this agreement would be reflected in the correlation of their ratings. Inter-rater reliability is increased if the observers have appropriate training. The training should focus on what exactly is meant to be observed. The raters need to be given a clear description of the event to be observed. The classroom observers would need to know what is and is not appropriate behavior. The raters also need concrete examples of what constitutes an occurrence or what constitutes achievement at each criterion level. The gymnastic judges need to know the standards for each element of the gymnastic routine. Training is best when it includes much practice with feedback. Test-retest reliability is appropriate for tests that measure a construct that is not likely to change. The construct that intelligence tests measure is not expected to change. Another well-known test with an expectedly unchanging construct is the Scholastic Aptitude Test (SAT). Although a test taker is allowed to take the SAT up to three times, the developers claim that the score on repeated administrations will not change. The construct that the SAT is measuring is the predicted adaptability to college. By the time students take the SAT, they are as prepared for college as they are going to be. Test-retest reliability is described as the correlation between the distribution of scores on one administration and the distribution of scores on a subsequent administration. Test-retest reliability is also an important factor in some experimental designs in which the treatment group is administered a pretest and posttest with treatment in between and the control group only receives the pretest and the posttest. Any analysis of the difference noted in the results of the posttest (compared to the pretest) of the treatment group is confounded unless there is a strong reliability between the pretest and posttest of the control group. Parallel-forms reliability evaluates the consistency of the results of two tests constructed in the same manner from the same content domain. For every item on the test, a similar item is developed with the same difficulty level. The items from each pair are then randomly assigned to one form of the test or the other. The resulting two tests are the same in content and difficulty but not expression. The reliability is described as the correlation of the two distributions of scores. This type of reliability is important in the development of standardized tests. Split half reliability is similar to parallel forms except that the two forms are both incorporated into one test. After the test is administered, the scores are divided into the two forms and the correlation between the two distributions of scores is calculated. Like parallel forms it is important in the development of standardized tests. However, it could have classroom applications if the classroom teacher was willing to make the effort to develop a test with twice as many items as an ordinary test. In the classroom, split-half reliability could detect the effect of students' guessing on the test. Inter-item reliability is another means of evaluating the reliability of one administration of one test. Most tests are made up of items that are related to one another because they are measuring similar concepts. Because the items are similar in design, there should be a measurable correlation between the items in any pair of items. The evaluation of inter-item reliability begins with predicting all correlations between all pairs of items. The inter-item reliability is expressed as the proportion of correct predictions. A classroom teacher might want to use inter-item reliability to identify the items that were not related to any other items or to identify the effects of students' guessing. Cronbach's Alpha and the Kuder-Richardson methods are systems of reporting internal consistency of a test. The essential results of the internal consistency methods are comparable to the average of all correlations between all pairs of items. These methods can estimate reliability using the results of only one administration of the test. The main difference between the two approaches is how the items are scored. Cronbach's Alpha can be used on items with a range of responses such as a Likert scale. The Kuder-Richardson methods require that all items be scored dichot-omously right or wrong (Borg and Gall, 1983). 'PROCEDURES TO INCREASE RELIABILITY ' The general goal to increase reliability of a measure is to increase the variance while reducing the variance error. Three recommended procedures to accomplish this are: 1) decrease the ambiguity of the test items; 2) increase the number of items per objective; and 3) provide clear test-taking instructions (Kerlinger, 1986). If an item is ambiguous, it can be interpreted in more than one way. Two test takers of equal ability could conceivably interpret an ambiguous item two different ways, one getting it right and the other getting it wrong. Their score would differ based on their interpretation of the item and not based on their differences in true ability. Where there is error in a test item, it will have less effect if that item is one among many for the same objective than if that item is one among few. A test taker whose ability is mismeasured by a faulty item will need to balance the effect of that item with the effect of the items that are measuring more accurately. Clear test-taking instructions help test takers to interpret the test items correctly and to indicate their chosen answers properly. Test instructions might remind the test takers of the types of items that require special attention. In addition, if there is a special procedure for answering such as using an answering sheet, test instructions can remind test takers how to respond correctly. 'References ' Borg, W. R., Gall, M. D. (1983). Educational research: An introduction (4th ed.) White Plains, NY: Longman. Kerlinger, F. N. (1986). Foundations of behavioral research (3rd ed.). Fort Worth, TX: Holt, Rinehart and Winston.